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Cluster -Based Operating System-Agnostic Virtual 

Computing System 

CROSS-REFERENCE TO RELATED APPLICATIONS 

[0001] This application claims the benefit of Provi- 

sional Application No. 60/494,392, filed August 11, 2003, and 
of Provisional Application No. 60/499,646, filed September 2, 
2003 . 

REFERENCE TO COMPUTER PROGRAM LISTING APPENDIX 

[0002] A computer program listing appendix is submitted 

herewith on one compact disc and one duplicate compact disc. 
The total number of compact discs including duplicates is two. 
The files on the compact disc are software object code and ac- 
companying files for carrying out the invention. Their names, 
dates of creation, directory locations, and sizes in bytes are: 

[0003] .CONFIG of August 27, 2003 located in the root 

folder and of length 28,335 bytes; 

[0004] BIOS. HEX of August 27, 2003 located in the root 

folder and of length 241,6 64 bytes; 

[0005] SCMPVMMO .HEX of August 27, 2003 located in the 

root folder and of length 201,603 bytes; 

[0006] SCMPVMMS.HEX of August 27, 2003 located in the 

root folder and of length 20,119 bytes; and 

[0007] USERMODE . HEX of August 27, 2 0 03 located in the 

root folder and of length 37,170 bytes. 

[0008] The material on the compact discs is incorpo- 

rated by reference herein. 

[0009] Installation and - execution instructions for the 

material on the compact disks are provided hereinbelow at Ap- 
pendix 1 . 
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BACKGROUND OF THE INVENTION 

1. Field of the Invention. 

[0010] This invention relates to virtual computers. 

More particularly, this invention relates to improvements in a 
5 cluster-based symmetric multiprocessor. 

2. Description of the Related Art. 

[0011] The meanings of certain acronyms and terminology 

used herein are given in Table 1 . 



Table 1 



API 


Application programming interface 


CPU 


Central processing unit 


DMA 


Direct Memory Access - used by hardware de- 
vices, which are required to copy data to 
and from main system memory. DMA is used to 
relieve the CPU from waiting during memory 
accesses . 


False shar- 
ing 


In shared memory multiprocessors, when proc- 
essors make references to different data 
items within the same block even though 
there is no actual dependence between the 
references . 


FSB 


Front -side bus 


NIC 


Network interface card 


NUMA 


Non-uniform memory access 


PCI 


Peripheral Component Interconnect - a stan- 
dard for peripheral software and hardware 
interfaces . 


SMP 


Symmetric multiprocessor 


TLB 


Translation lookaside buffer 


VM 


Virtual machine 


VMM 


Virtual machine monitor 



10 

[0012] A portion of the disclosure of this patent docu- 

ment, which includes a CD-ROM appendix, contains material that 
is subject to copyright protection. The copyright owner has no 
objection to the facsimile reproduction by anyone of the patent 
15 document or the patent disclosure, as it appears in the Patent 
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and Trademark Office patent file or records, but otherwise re- 
serves all copyright rights whatsoever. 

[0013] The use of virtual computers (generally referred 

to as "virtual machines") to enhance computing power has been 
known for several decades. For example, a classic system, VM, 
produced by IBM , enabled multiple users to concurrently use a 
single computer by running multiple copies of the operating 
system. Virtual computers have been realized on many different 
types of computer hardware platforms, including both single- 
processor and multi-processor units. 

[0014] Some virtual machine monitors are able to pro- 

vide concurrent support for diverse operating systems. This re- 
quires the virtual machine" monitor to present a virtual ma- 
chine, that is a coherent view of the hardware, to each operat- 
ing system. The above -noted VM system has evolved to the point 

© 

where it is asserted that in one version, z/VM , available from 
IBM, New Orchard Road, Armonk, NY, multiple operating systems 
can execute on a single server. 

[0015] Despite these achievements in virtual computing, 

practical issues remain. The currently dominant personal com- 
puter architecture, X86/IA32, which is used in the Intel Pen- 
tium™ and other Intel microprocessors, is not conducive to 
virtualization techniques for two reasons: (1) the instruction 
set of the CPU is not natively virtualizable ; and (2) the 
X86/IA32 architecture has an open I/O architecture, which com- 
plicates the sharing of devices among different operating sys- 
tems. This has been an impediment to continued advancements in 
the field. In general, it is inefficient, and probably imprac- 
tical, for multiple operating systems to concurrently share 
common X86/IA32 hardware directly. System features of the 
X86/IA32 CPU are designed to be configured and used in a coor- 
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dinated effort by only one operating system, e.g., paging and 
protection mechanisms, and segmentation. 

[0016] Limitations of the X86/IA32 architecture can be 

appreciated by a brief explanation of one known approach to 
5 virtual computers, in which a virtual machine monitor is used 
to provide a uniform execution environment within a computer. A 
virtual machine monitor is a software layer that in this ap- 
proach is interposed between hardware of a single computer and 
one or more guest operating systems that support different ap- 
10 plications. In this arrangement the virtual machine moni- 
tor interacts directly with the hardware, and exposes an ex- 
pected interface to the guest operating systems. This interface 
includes normal hardware facilities, e.g., CPU, I/O, and mem- 
ory. 

15 [0017] When virtualizat ion is properly done, the guest 

operating systems are unaware that they are interacting with a 
virtual machine instead of directly with the hardware. For ex- 
ample, low level disk operations invoked by the operating sys- 
tems, interaction with system timers, interrupts and exception 

2 0 handling are all managed transparently by the guest operating 

systems via the virtual machine monitor. To accomplish this, it 
is necessary that the virtual machine monitor be able to trap 
and execute certain hardware instructions dealing with the 
state of the processor. 
25 [0018] Significantly, the X86/IA32 employs four modes 

of protected operation, which are conveniently conceptualized 
as rings of protection, known as protection rings 0-3. Pro- 
tection ring 0 is the most protected, and was designed for exe- 
cution of the operating system kernel. Privileged instructions 

3 0 available only under protection ring 0 include instructions 

dealing with interrupt handling, and the modification of proc- 
essor flags and page tables. Typical examples are store in- 
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structions for the global descriptor table (SGDT) and interrupt 
descriptor table (SIDT) . Protection rings 1 and 2 were designed 
for other operating system services, e.g., device drivers. Pro- 
tection ring 3, the least privileged, was intended for applica- 
5 tions, and is also referred to as user mode. If it were possi- 
ble to trap all of the privileged X86/IA32 instructions in user 
mode, it would be relatively straightforward for the virtual 
machine monitor to handle them using ordinary exception- 
handling techniques. Unfortunately, there are many privileged 

10 instructions of the X86/IA32 instruction set, which cannot be 
trapped under protection ring 3 . Attempts to naively execute 
privileged instructions under protection ring 3 typically re- 
sult in a general protection fault . 

[0019] Because of the importance of the X86/IA32 archi- 

15 tecture, considerable effort has been devoted to overcoming its 
limitations with regard to virtualization. Virtual machines 
have been proposed to be implemented by software emulation of 
at least the privileged instructions of the X86/IA32 instruc- 
tion set. Alternatively, binary translation techniques can be 

20 utilized in the emulator. Binary translation techniques in con- 
nection with a virtual machine monitor are disclosed in U.S. 
Patent No. 6,397,242, the disclosure of which is incorporated 
herein by reference. Additionally or alternatively, combina- 
tions of direct execution and binary translation can be imple- 

2 5 merited . The open source Bochs IA-32 Emulator, downloadable via 
the Internet at the URL http://bochs.sourceforge.net/, is an 
example of a complete emulator. Another example is the SimOS 
environment, available via the Internet at the URL 
http://simos.stanford.edu/. The SimOS environment is adapted to 

30 the MIPS R4000 and R10000 and Digital Alpha processor families. 
Generally, the performance of emulators is relatively slow. 



49563 Ver. 49563Sll.doc 

6 

[0020] Another known approach employs a hosted archi- 

tecture. A virtual machine application uses a VM driver to load 
a virtual machine monitor at a privileged level . Typical of 
this approach are the disclosures of U.S. Patent Nos . 6,075,938 
5 and 6,496,847, which are incorporated herein by reference. The 
virtual machine monitor then uses the I/O services of a host 
operating system to accommodate user-level VM applications. 
Current examples of this approach include the VMware Worksta- 
tion™, the VMware GSX Server™, both available from VMware, 

10 Inc., 3145 Porter Drive, Palo Alto, CA 94304, and the Connectix 
Virtual PC™, available from Microsoft Corporation, One Micro- 
soft Way, Redmond, WA 98052-6399. Another example is the open 
source Plex86 Virtual Machine, available via the Internet at 
the URL http://plex86.sourceforge.net/. The hosted architecture 

15 is attractive due to its simplicity. However, it incurs a per- 
formance penalty because the virtual machine monitor must it- 
self run as a scheduled application under the ho*st operating 
system, and could even be swapped out. Furthermore, it requires 
emulators to be written and maintained for diverse I/O devices 

2 0 that are invoked by the virtual machine monitor. 

[0021] It is known in the art to use multiple proces- 

sors in a single computer in order to enhance overall system 
performance. One known architecture is symmetric multiproces- 
sing (SMP) , in which application programs are processed by mul- 

2 5 tiple processors that share a common operating system and mem- 

ory. Typically, the processors share memory and the I/O bus or 
data path, and are controlled by a single instance of an oper- 
ating system. In order to enhance performance, SMP systems may 
employ non-uniform memory access (NUMA) , a method of conf igur- 

3 0 ing the microprocessors so that they can share memory locally. 

[0022] In a variation of multiprocessing systems, mul- 

tiple relatively small computers, either uniprocessors or mul- 
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tiprocessors having relatively few processors, are linked to- 
gether and coordinated to execute multiple applications, while 
serving one or more users. This arrangement is known as a clus- 
ter, or scaled-out arrangement. Some systems of this type can 
5 outperform corresponding SMP configurations. However, in the 
past it has been necessary that applications for cluster-based 
systems be specialized, so that they are cluster-aware. This 
has increased development expense, and in some cases, has im- 
peded the use of standard commercial software on cluster-based 
10 systems. 

[0023] An unsuccessful attempt to implement a VM com- 

puting paradigm on cluster-based systems is disclosed in the 
document The Memory and Communication Subsystem of Virtual Ma- 
chines for Cluster Computing, Shiliang Hu and Xidong Wang, 

15 Jan. 2002 (Hu et al . ) , published on the Internet at the URL 
http : //www. cs . wise . edu/~wxd/report/ece902 .pdf . In this proposed 
arrangement, multiple SMP clusters of NUMA-like processors are 
monitored by virtual machine monitors. A cluster intercon- 
nect deals with message passing among the clusters. The system 

20 consists of multiple virtual machines that operate under a sin- 
gle operating system, and support parallel programming models. 
While a virtual computer built according to this paradigm would 
initially appear to be highly scalable, preliminary simulations 
of the communication and memory subsystems were discouraging. A 

25 further difficulty is posed by limitations of current operating 
systems, which are generally unaware of the locality of NUMA- 
type memory. According to Hu et al . , the proposed paradigm 
could not be reduced to practice until substantial technologi- 
cal changes occur in the industry. Thus Hu et al . appears to 

30 have encountered a well-known difficulty: cluster machines gen- 
erally, and NUMA machines in particular, can be scaled up suc- 
cessfully only if some way is found to ensure a high computa- 
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tion to communication ratio in regard to both data distribution 
and explicit communication among the clusters and processors. 

[0024] The most successful of the solutions noted 

above, in the case of the IBM z/VM product, have relied upon 
5 revisions and optimizations of the underlying computer hardware 
in order to overcome the issues encountered by Hu et al . , and 
to increase performance generally, or have required kernel 
modifications of operating system software, in the case of the 
above-noted VMWare products. These approaches are costly in 
10 terms of product development, marketing, and maintenance, and 
often commercially impracticable, due to secrecy policies of 
operating system software vendors. 

SUMMARY OF THE INVENTION 

[0025] According to a disclosed embodiment of the in- 

15 vention, an improved cluster-based collection of computers 
(nodes) is realized using unmodified conventional computer 
hardware and unmodified operating system software. Software is 
provided that enable a virtual machine to be presented to a 
guest operating system, wherein each node participating with 

2 0 the virtual machine has its own emulator or virtual machine 
monitor. VM memory coherency and I/O coherency are provided by 
hooks, which result in the manipulation of internal processor 
structures. A private network provides communication among the 
nodes . 

25 [0026] The invention provides a method for executing a 

software application in a plurality of computing nodes has node 
resources, wherein the nodes include a first node and a second 
node that intercommunicate over a network, and the nodes is 
operative to execute a virtual machine that runs under a guest 

30 operating system. The method is carried out by running at least 
a first virtual machine implementer and a second virtual 
machine implementer on the first node and the second node, 
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respectively, and sharing the virtual machine between the first 
virtual machine implementer and the second virtual machine 
implementer . 

[0027] An aspect of the method includes running the 

5 software application over the guest operating system, so that 
commands invoked by the software application are monitored or 
emulated by the first virtual machine implementer and by the 
second virtual machine implementer on the first node and the 
second node, while the node resources of the first node and the 
10 second node are shared by communication over the network. 

[0028] According to an additional aspect of the method, 

at least one of the first virtual machine implementer and the 
second virtual machine implementer is a virtual machine moni- 
tor . 

15 [0029] According to one aspect of the method, at least 

one of the first virtual machine implementer and the second 
virtual machine implementer is an emulator. 

[0030] According to still another aspect of the method, 

at least the first node has a first virtual node that includes 

20 a first physical CPU of the first node and has a second virtual 
node that includes a second physical CPU of the first node. 

[0031] According to another aspect of the method, there 

are a plurality of virtual machines including a first virtual 
machine and a second virtual machine, wherein the first virtual 

25 machine and the second virtual machine have a plurality of vir- 
tual CPU's that are virtualized by the first virtual machine 
implementer in the first node based on a first physical CPU and 
by the second virtual machine implementer in the second node 
based on a second physical CPU, respectively. 

30 [0032] According to yet another aspect of the method, 

and a first virtual node includes the first physical CPU and 
the second physical CPU. 
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[0033] According to a further aspect of the method, the 

first virtual machine implementer virtualizes at least one of 
the virtual CPU' s of the first virtual machine based on the 
first physical CPU and virtualizes at least one of the virtual 
5 CPU's in the second virtual machine based on the second physi- 
cal CPU. 

[0034] Another aspect of the method includes providing 

a management system for the first virtual machine implementer 
and the second virtual machine implementer to control the first 

10 node and the second node, respectively, wherein the management 
system includes a wrapper for receiving calls to a device 
driver from the first virtual machine implementer, the wrapper 
invoking the device driver according to a requirement of the 
first virtual machine implementer. 

15 [0035] A further aspect of the method includes provid- 

ing a virtual PCI controller for the management system to con- 
trol a physical PCI controller in one of the nodes. 

[0036] Yet another aspect of the method includes pro- 

viding a virtual DMA controller for the management system to 

20 control a physical DMA controller in one of the nodes. 

[0037] Still another aspect of the method includes pro- 

viding a virtual PCI controller to control a physical PCI con- 
troller in one of the nodes, and during a bootup phase of op- 
eration scanning a device list with the virtual PCI controller 

2 5 to remap memory regions and resources and identify devices hav- 

ing on-board DMA controllers. 

[0038] In one aspect of the method the virtual machine 

implementer maintains mirrors of a memory used by the guest op- 
erating system in each of the nodes, the method further includ- 

3 0 ing write- invalidating at least a portion of a page of the mem- 

ory in one of the nodes, and transferring a valid copy of the 
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portion of the page to the one node from another of the nodes 
via the network. 

[0039] The invention provides a computer software prod- 

uct, including a computer-readable medium in which computer 
5 program instructions are stored, which instructions, when read 
by a computer, cause the computer to perform a method for exe- 
cuting a software application in a plurality of computing nodes 
has node resources, wherein the nodes include a first node and 
a second node that intercommunicate over a network, and the 

10 nodes is operative to execute a virtual machine that runs under 
a guest operating system. The method is carried out by running 
at least a first virtual machine implementer and a second vir- 
tual machine implementer on the first node and the second node, 
respectively, and sharing the virtual machine between the first 

15 virtual machine implementer and the second virtual machine im- 
plementer . 

[0040] The invention provides a computer system for 

executing a software application, including a plurality of com- 
puting nodes, the plurality of computing nodes including at 

2 0 least a first node and a second node, a network connected to 
the first node and the second node providing intercommunication 
therebetween, a first virtual machine implementer and a second 
virtual machine implementer executing on the first node and the 
second node, respectively. The system further includes a vir- 

25 tual machine implemented concurrently by at least the first 
virtual machine implementer and the second virtual machine im- 
plementer, and a guest operating system executing over the vir- 
tual machine, wherein the software application executes over 
the guest operating system, so that commands invoked by the 

30 software application are received by the first virtual machine 
implementer and the second virtual machine implementer on the 
first node and the second node, while the node resources of the 
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first node and the second node are shared by communication over 
the network. 

BRIEF DESCRIPTION OF THE DRAWINGS 

[0041] For a better understanding of the present inven- 

5 tion, reference is made to the detailed description of the in- 
vention, by way of example, which is to be read in conjunction 
with the following drawings, wherein like elements are given 
like reference numerals, and wherein: 

[0042] Fig. 1 is a block diagram of a cluster-based 

10 virtual computing arrangement that is constructed and operative 
in accordance with a disclosed embodiment of the invention; 

[0043] Fig. 2 is a detailed block diagram of a virtual 

machine monitor that is constructed and operative in accordance 
with an alternate embodiment of the invention; 
15 [0044] Fig. 3 is a detailed block diagram of an alter- 

nate virtual machine monitor that is constructed and operative 
in accordance with an alternate embodiment of the invention; 

[0045] Fig. 4 is a block diagram of a cluster-based 

virtual computing arrangement employing multiprocessors and 

2 0 virtual nodes in which there are a plurality of virtual machine 

implementers per node that is constructed and operative in ac- 
cordance with an alternate embodiment of the invention; 

[0046] Fig. 5 is a block diagram of a cluster-based 

virtual computing arrangement employing multiprocessors and 
25 virtual nodes having a plurality of virtual machine implemen- 
ters per CPU that is constructed and operative in accordance 
with an alternate embodiment of the invention; and 

[0047] Fig .6 is a block diagram of a cluster -based 

virtual computing arrangement that employs a virtual machine 

3 0 monitor having a management system, that is constructed and op- 

erative in accordance with an alternate embodiment of the in- 
vention. 
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DETAILED DESCRIPTION OF THE INVENTION 

[0048] In the following description, numerous specific 

details are set forth in order to .provide a thorough under- 
standing of the present invention. It will be apparent to one 
skilled in the art, however, that the present invention may be 
practiced without these specific details. In other instances 
well-known circuits, control logic, and the details of computer 
program instructions for conventional algorithms and processes 
have not been shown in detail in order not to unnecessarily ob- 
scure the present invention. 

[0049] Software programming code, which embodies as- 

pects of the present invention, is typically maintained in per- 
manent storage, such as a computer readable medium. In a cli- 
ent/server environment, such software programming code may be 
stored on a client or a server. The software programming code 
may be embodied on any of a variety of known media for use with 
a data processing system. This includes, but is not limited to, 
magnetic and optical storage devices such as disk drives, mag- 
netic tape, compact discs (CD's), digital video discs, (DVD's), 
and computer instruction signals embodied in a transmission me- 
dium with or without a carrier wave upon which the signals are 
modulated. For example, the transmission medium may include a 
communications network, such as the Internet. 

Introductory Comments. 

[0050] A virtual node is the combination of a dedicated 

memory segment, a dedicated device group (which can contain no 
devices) , and at least one CPU. A virtual machine implementer, 
such as a virtual machine monitor or machine emulator or simu- 
lator, disguises the virtual machine, so that an operating sys- 
tem that issues calls to the virtual machine can use only the 
virtual node resources . 
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[0051] A virtual CPU is an object that appears to be a 

CPU from the perspective of a virtual machine. The operating 
system is unaware that it is controlling a virtual CPU rather 
than a physical CPU. The virtual machine implementer can re- 
place the CPU context with several virtual CPU contexts, hence 
virtualizing more than one CPU based on one physical CPU. 

Embodiment 1 . 

[0052] Turning now to the drawings, reference is ini- 

tially made to Fig. 1, which is a block diagram of a cluster- 
based virtual computing system 10 that is constructed and op- 
erative in accordance with a disclosed embodiment of the inven- 
tion. A plurality of user applications 12, 14, 16 execute si- 
multaneously, supported by a guest operating system 18, which 
can be any conventional unmodified operating system supported 
by the instruction set architecture (ISA) of a plurality of 
nodes 22, 24, 26, e.g., Microsoft Windows®, Unix®, Linux®, or 
Solaris® X86 in the case of the X86/IA32 ISA. The guest operat- 
ing system 18 controls a virtual machine 20, which presents it- 
self to the guest operating system 18 as though it were a con- 
ventional real machine. While the system 10 is disclosed with 
reference to the X86/IA32 family of processors, the invention 
is not limited to the X86/IA32 family of processors, but is ap- 
plicable to other computer architectures. 

[0053] While three nodes are shown, the system 10 is 

scalable, and any number of nodes may be present, depending on 
the needs of a particular application and the performance de- 
sired. The nodes 22, 24, 2 6 each comprise computer hardware 28, 
which in a current embodiment use the X86/IA32 ISA. Instruc- 
tions of the guest operating system 18 are distributed for exe- 
cution among the nodes 22, 24, 2 6 as though the system 10 were 
a single SMP machine with NUMA-like shared memory. This "vir- 
tual SMP" operation is transparent to the guest operating sys- 
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tern 18 and to the applications 12, 14, 16, which consequently 
benefit from enhanced computing speed without having to be 
"cluster-aware . " 

[0054] The hardware 2 8 includes nodal memory 3 0 and may 

5 also be provided with many other types of conventional personal 
computer devices 32, for example, I/O devices and NIC's or 
other network communications facilities. Different versions of 
the X86/IA32 ISA compatible processor may be placed in differ- 
ent nodes, and various other aspects of the computer hardware 

10 may vary in different nodes. For example, the processor speed, 
bus speed, memory configuration, and I/O facilities may vary 
among the different nodes. It is only necessary that the dif- 
ferent nodes all support a common ISA. Even this limitation can 
removed by using a full machine emulator to emulate an ISA that 

15 differs from the ISA of the system on which it is running. 

[0055] The system 10 is not dependent on any particular 

virtual machine implementation technique in any particular 
node. This point is emphasized in the exemplary configuration 
shown in Fig . 1 , in which the nodes 22 , 24 are provided with 

2 0 virtual machine monitors 34, 36, which can differ in implemen- 
tation technique or hardware. For example, the virtual machine 
monitors 34, 3 6 could be different products, such as the above 
noted plex86, Xen (available via the Internet at the URL 
www . cl . cam. ac . uk/Research/SRG/netos/xen/downloads . html ) , VMWare 

25 workstation, Microsoft virtual server, or any other similar 
product. The node 2 6 does not have a virtual machine monitor. 
Instead, it is virtualized by an emulator 38, which can be the 
Bochs IA-32 Emulator. 

[0056] One of the main functions of a virtual computer 

30 is virtualized execution of the kernel code. Virtualized execu- 
tion means that the guest operating system 18 receives effec- 
tively the same results from having its code executed on a vir- 



49563 Ver. 49563Sll.doc 

16 

tual computer as on a real computer. Code of the guest operat- 
ing system 18 is ultimately executed via the virtual machine 2 0 
on the CPU's of the hardware 28. Therefore, a core element in 
the functionality of a virtual computer is the virtualization 
5 of the CPU instructions, the execution of which would otherwise 
break the virtualization and cause inconsistent operation or 
even total breakdown of the guest operating system. To this 
end, virtualized kernel code execution is performed in the vir- 
tual machine monitors 34, 36, and emulated in the emulator 38. 

10 The virtual machine monitors 34, 3 6 catch faults, exceptions 
and interrupts generated in the hardware 28, whether arising in 
the CPU or in other components of the hardware 28. The main 
task of the virtual machine monitors 34, 36 is to handle the 
faults, exceptions and interrupts in a manner that leads the 

15 guest operating system 18 to perceive that its own execution is 
as expected. Thus, the virtual machine can be implemented using 
any combination of the above-noted known techniques, e.g., vir- 
tual machine monitor, emulation with or without binary transla- 
tion, or combinations thereof, or variants of a hosted archi- 

20 tecture. The system 10 can be constructed using different types 
of emulators and different types of virtual machine monitors in 
many combinations. 

[0057] Memory coherence among the nodes 22, 24, 26 is 

achieved by a memory management module 40, which maintains cop- 

25 ies of all memory content on each instance of the memory 30, 
and maintains a record of page or sub-page validations and in- 
validations. Similarly, a single coherent I/O view is achieved 
by an I/O management module 42. The details of the memory man- 
agement module 40 and the I/O management module 42 are dis- 

30 closed in further detail hereinbelow. 

[0058] A private network 44 provides rapid internodal 

communication, which is necessary for diverse functions of the 
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virtual machine monitors 34, 36 and the emulator 38, including 
operation of the memory management module 40, the I/O manage- 
ment module 42, and processing of hardware and software inter- 
rupts between the nodes 22, 24, 26. The private network 44 may 
5 be realized using standard networking equipment. High band- 
width, low-latency network elements are used to boost perform- 
ance. Standard host operating system NIC drivers, for example 
Linux NIC drivers, can be used to operate NIC's for the private 
network 44 as one of the devices 32 in each of the nodes 22, 
10 24, 26. Other NIC's may also be included among the devices 32 
for guest operating system outbound network communications be- 
yond the cluster of the system 10. 

Virtual Machine Monitor. 

[0059] As shown in Fig. 1, the virtual machine moni- 

15 tor 34 runs on bare hardware. It is capable of supporting one 
or more virtual machines, but has the disadvantage that I/O de- 
vices must be supported by this type of virtual machine moni- 
tor. Reference is now made to Fig. 2, which is a detailed block 
diagram of an alternate virtual machine monitor 4 6 that is con- 

2 0 structed and operative in accordance with a disclosed embodi- 
ment of the invention, and which is suitable for use as the 
virtual machine monitor 34 in the system 10 (Fig. 1), and in 
the other embodiments of a virtual computing system disclosed 
herein. The virtual machine monitor 46 either integrally in- 

25 eludes, or can access a VM driver 48 that loads the virtual ma- 
chine monitor 4 6 into kernel memory, so that it can run at a 
privileged level. The virtual machine monitor 46 employs the 
services of an unmodified full host operating system 47 to con- 
trol the hardware 5. This method of operation is similar to the 

30 approach of the above-noted U.S. Patent No. 6,496,847, in which 
a user-level emulator accepts commands from a virtual machine 
monitor via a specialized system-level driver and processes 
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these commands as remote procedure calls. The emulator is able 
to issue host operating system calls and thereby access the 
physical system devices via the host operating system. The host 
operating system itself thus handles execution of certain vir- 
5 tual machine instructions, such as accessing physical devices. 
However, the technique of U.S. Patent No. 6,496,847 is only 
disclosed with respect to a single hardware node. The system 10 
(Fig. 1) also differs from the disclosure of the above-noted 
U.S. Patent No. 6,075,938, in which the virtual machine monitor 

10 is only shown to run on bare hardware, and to control a single 
multiprocessing computer. Furthermore, the system disclosed in 
U.S. Patent No. 6,075,938 requires kernel modifications of the 
host operating system to operate successfully. An implementa- 
tion of the virtual machine monitor 46 is found in the computer 

15 program listing appendix. 

[0060] Reference is now made to Fig. 3, which is a de- 

tailed block diagram of an alternate virtual machine monitor 54 
that is constructed and operative in accordance with a dis- 
closed embodiment of the invention. The virtual machine moni- 

2 0 tor 54 can be used in any of the embodiments of a virtual com- 
puting system disclosed herein. The virtual machine monitor 54 
does not rely upon the host operating system, but instead in- 
cludes a management system 56, which is mainly used during 
boot-up and for coordinating private network communications 

2 5 during normal operation. 

[0061] The management system 56 maintains a virtual PCI 

controller 58, which serves as a proxy between the guest oper- 
ating system and the physical PCI controllers. During boot-up, 
the virtual PCI controller 58 collects hardware information 

30 from the underlying hardware 5. Exploiting flexibilities of the 
PCI specification, it rearranges the PCI devices in the local 
node and throughout the cluster, using virtual PCI-to-PCI 
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bridges. The virtual PCI controller 58 also ascertains that 
there are no conflicts in the I/O ports and memory regions used 
by the physical PCI devices on the individual hardware 5 or 
elsewhere in the cluster. Thus, the virtual PCI controller 58 
5 makes the separate PCI buses of the individual nodes 22, 24, 26 
(Fig. 1) appear to the guest operation system 18 as a single 
PCI address space, i.e., a single bridged virtual PCI bus. Cur- 
rently prevalent commodity operating systems do not support 
multiple PCI buses. Nevertheless, in some embodiments, the vir- 

10 tual PCI controller 58 may have the capability of implementing 
multiple virtual PCI buses in anticipation that they may be 
supported by future commodity operating systems. 

[0062] Subsequently, the virtual PCI controller 58 

serves as a sniffer for PCI configuration actions taken by the 

15 guest operating system, and tracks any changes made by the 
guest operating system to the PCI devices' I/O ports and memory 
regions. It respects such changes and forwards them to the PCI 
host of the appropriate physical node. It is also responsible 
for updating internal tables regarding I/O port and memory re- 

20 gion assignments within the cluster. 

[0063] The virtual PCI controller 58 emulates hot- 

pluggable PCI events for the guest operating system. This al- 
lows for dynamic node addition and removal. If and when the 
physical hardware generates hot -pluggable PCI events, it is the 

25 responsibility of the virtual machine monitor 54 to forward 
these events to the guest operating system. 

[0064] The management system 56 includes a virtual DMA 

controller 60, which is a virtual layer that is capable of for- 
warding remote DMA requests between the guest operating system 

30 and remote nodes. The virtual DMA controller 60 is implemented 
by catching (intercepting) exceptions relating to reserved I/O 
ports assigned to a corresponding physical DMA Controller, 
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which may be a third party device. It is possible to differen- 
tiate DMA operations that can be performed entirely locally 
from those in which either or both the device or the memory 
area are remote. DMA operations, which are entirely local, are 
forwarded as quickly as possible to a physical DMA controller 
of the local hardware 5, and are performed with almost no de- 
lay. DMA operations that involve memory and a device that does 
not reside on the same node are handled by transferring remote 
pages to the node where the device resides via the private net- 
work 44, and executing the DMA operation on that node. 

[0065] In a normal PCI environment, multiple DMA con- 

trollers exist concurrently; possibly different DMA controllers 
may exist on different add-on cards, i.e., "first party" DMA 
controllers. Therefore, there needs to be a general solution to 
deal with the multitude of controllers. Each card may have its 
own rules and semantics for communicating with its respective 
DMA controller. However, there are a few commonly-used methods, 
each having its own semantics. The virtual DMA controller 60 
(Fig. 3) may provide a high-level language for defining in a 
unified manner, which I/O Ports, memory addresses, and se- 
quences are required to be intercepted by the virtual machine 
monitor 54. Such values are monitored and recorded by the vir- 
tual machine monitor 54 during normal operation. 

[0066] When a DMA operation involving a first party DMA 

controller is initiated, usually by writing a certain value to 
a DMA controller port or memory register, the DMA operation is 
performed and the memory is marked by the virtual DMA control- 
ler 60 as invalid or locked on all other machines except the 
machine on which the DMA controller resides. Once notification 
of a successful DMA operation from a card is detected in a vir- 
tual machine monitor, either by an interrupt or by polling the 
appropriate I/O ports or memory ranges, that memory is again 
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marked as unlocked, and available for access by remote ma- 
chines. An alternate optimization method may be offered to al- 
low incoming DMA operations, i.e., device to memory, to instan- 
tiate the operation in predefined reserve memory and copy the 
reserve memory to the guest operating system memory area once 
the operation is completed. This will prevent locking the mem- 
ory accessed by the DMA operation for a long time. 

Bootup. 

[0067] When power is initially applied to a PCI device, 

the hardware remains inactive. In other words, the device only 
responds to configuration transactions. At power-on, the device 
has no memory and no I/O ports mapped in the computer's address 
space; every other device- specif ic feature, such as interrupt 
reporting, is disabled as well. Fortunately, every PCI mother- 
board is equipped with PCI -aware firmware: the BIOS. The firm- 
ware offers access to the device configuration address space by 
reading and writing registers in the PCI controller. 

[0068] At system boot, the firmware or the OS, for ex- 

ample the Linux kernel, performs configuration transactions 
with every PCI peripheral in order to allocate a safe place for 
any address region it offers. By the time a device driver ac- 
cesses the device, its memory and I/O regions have already been 
mapped into the processor's address space. While a device 
driver can change this default assignment, in practice this is 
not done . 

[0069] The virtual PCI controller 58 takes control at 

this stage, reading all of the device configuration data, stor- 
ing it in one node, e.g., a master node, and performs a remap- 
ping of all regions and resources. After this remapping is com- 
pleted, it is delegated to the actual physical PCI controllers. 
The virtual PCI controller 58 scans the device list, and deals 
specially with certain device ID'S that are known to have on- 
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board DMA controllers, e.g., IDE cards, NIC's, and SCSI Con- 
trollers. Such DMA controllers are virtualized by the virtual 
DMA controller 60 so that DMA operations on these cards can 
take place. 

[0070] Eventually, the management system 56 requests 

configuration data for all devices, which is supplied by the 
virtual PCI controller 58. 

[0071] During normal operation the virtual PCI control- 

ler 58 continually tracks hardware configuration changes, in- 
cluding requests by the guest operating system to map or remap 
hardware regions. A table, mapping regions to actual node ID'S, 
is maintained and updated. 

Memory Coherence. 

[0072] Each virtual machine presents a single coherent 

shared memory to the guest operating system, while physical 
memory 30 may be distributed across multiple nodes. To support 
this functionality transparently to the guest operating system, 
several techniques are used in different combinations, as may 
required to optimize the performance and reliability of a par- 
ticular cluster-based system. 

[0073] Referring again to Fig. 1 and Fig. 3, in one em- 

bodiment memory mirroring is used across all the nodes 22, 24, 
2 6 (Fig. 1) . Memory mirroring provides protection for memory 
analogous to the protection afforded hard disk drives by RAID-1 
disk mirroring. Reliability may be enhanced by using Chipkill™ 
memory, available from IBM, New Orchard Road, Armonk, NY, which 
allows multiple errors to be corrected. Another technique that 
can be employed to enhance reliability is elliptical curve 
cryptography (ECC) of data. 

[0074] Page or sub-page validations and write- 

invalidations are performed by the virtual machine monitor 34, 
and communicated to the other nodes using the private net- 
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work 44. When an invalid page is required by a particular node, 
memory migration is performed, originating from a node having a 
valid copy of that page. As CPU's provide page-based memory ac- 
cess protection, implementation of page level granularity is 
sufficient in most cases. That is to say, page-size internodal 
memory transfers are performed. In some cases, where only a 
portion of a page is frequently invalidated, sub-page granular- 
ity can be achieved adaptively using the same page level granu- 
larity mechanism with additional software. This prevents false 
sharing and has the additional benefit of reducing internodal 
traffic on the private network 44 . 

[0075] Further aspects of the coherent memory system 

used in embodiments of the present invention are described be- 
low in the subsection entitled "Memory Management Subsystem." 

Embodiment 2 . 

[0076] Reference is now made to Fig. 4, which is a 

block diagram of a cluster-based virtual computing system 64 
that is constructed and operative in accordance with an alter- 
nate embodiment of the invention. In this embodiment there are 
a plurality of nodes 66, 68, 70 that are realized as multiproc- 
essor computer hardware, including memory 72, I/O devices 85 
and at least two CPUs 74, 76 per node. In one configuration of 
the system 64, each CPU in a node is included in a different 
virtual node, and is controlled by a different virtual machine. 
One virtual machine implementer is thus capable of using, one 
physical CPU to virtualize a plurality of virtual CPU's. 

[0077] The system 64 employs two guest operating sys- 

tems 18, 19 to concurrently execute multiple applications 12 
13, 14, 15, 16, 17. Applications 12, 13, 14 are supported by 
the guest operating system 18. Applications 15, 16, 17 are sup- 
ported by the guest operating system 19. 
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[0078] The guest operating systems 18, 19 control vir- 

tual machines 86, 88, respectively. Each virtual machine has a 
plurality of virtual CPU's 21. Three virtual CPU's are shown; 
however, larger numbers of CPU's can be virtualized. Further- 
more, none of the nodes 66, 68, 70, the virtual nodes 90, 92 or 
the virtual machines 86, 88 needs to be configured identically. 
In fact, the virtual machines 86, 88 can have different numbers 
of virtual CPU's. The virtual machines 86, 88 are provided with 
virtual memory 23, and virtual I/O devices 25. 

[0079] Two virtual machine implementers 78, 80 are in- 

cluded with each of the nodes 66, 68, 70 to implement the vir- 
tual machines 86, 88. The virtual machine implementers 78, 80 
can be virtual machine monitors or emulators in any combina- 
tion. The number of virtual machine implementers is only par- 
tially related to the number of CPU's in a node. The number of 
virtual machine implementers more closely relates to the imple- 
mentation method itself. For example, multiple emulators can 
run over one CPU. Alternatively, each emulator can provide mul- 
tiple virtual CPU's, as is disclosed below (Embodiment 3). 

[0080] A unit comprising the CPU 76, and a dedicated 

segment of the memory 72 makes use of only part of the comput- 
ing resource of the hardware, such a device group, and is known 
as a virtual node. A virtual node may make use of one CPU of a 
multiprocessor, or more. The node 68, for example, has two vir- 
tual nodes 90, 92, which are enclosed by broken lines. The sys- 
tem 64 is flexible in its ability to deal with I/O devices that 
are physically distributed among the nodes 66, 68, 70 transpar- 
ently to the guest operating systems 18, 19. To support this 
functionality, in the node 68 the virtual machine implemen- 
ter 78 is associated with the virtual node 90,. and the virtual 
machine implementer 80 with the virtual node 92. The I/O de- 
vices 85 in the node 68 may be arbitrarily segmented into de- 
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vice groups 82, 84, which are accessible to the virtual ma- 
chines 86, 88, (in addition to the I/O devices in the nodes 66, 
70) . The I/O devices 85 in the node 68 are also accessible by 
the nodes 66, 70, using the private network 44. The device 
groups 82, 84 are controlled respectively by the virtual ma- 
chine implementers 78, 80. In the node 68, the CPU 74 is con- 
trolled by the virtual machine implementer 78, the virtual ma- 
chine 86, and the guest operating system 18. The CPU 76 is con- 
trolled by the virtual machine implementer 80, the virtual ma- 
chine 88, and the guest operating system 19. Thus, two operat- 
ing systems simultaneously control one physical node. 

Embodiment 3 

[0081] Reference is now made to Fig. 5, which is a 

block diagram of a cluster-based virtual computing system 94 
that is constructed and operative in accordance with an alter- 
nate embodiment of the invention. The system 94 is similar to 
the system 64 (Fig. 4) , but has even finer granularity. As in 
the system 64, the system 94 is provided with nodes in which 
there is more than one virtual machine implementer per physical 
node. In addition, one physical CPU is used to virtualize a 
plurality of virtual CPU's, which are distributed in the same 
or different virtual nodes. 

[0082] The system 94 has a node 69, which has a hard- 

ware configuration that is identical to the node 68 (Fig. 4). 
However, a virtual machine implementer 107 in the node 69 vir- 
tualizes the CPU 74 and participates in a virtual machine 95. A 
virtual machine implementer 109 virtualizes the CPU 76, and 
participates in two virtual machines 95, 97. It will be noted 
that the virtual machine 95 contains four virtual CPU's 21, 
while the virtual machine 97 has three virtual CPU's 21. A vir- 
tual node 103 includes the CPU 74 and shares the CPU 76 with 
another virtual node 105. Thus, in the system 94, the CPU 76 
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participates in two virtual nodes 103, 105 , and is simultane- 
ously controlled by the two guest operating systems 18, 19. It 
is the role of the virtual machine implementer to allow such 
coparticipation in an efficient way. 

[0083] It is possible to configure the nodes of the sys- 

tem 94 in many combinations. For example, all of the nodes may 
be configured with a plurality of virtual CPUs per physical 
CPU, which may belong to same or different virtual nodes. Fur- 
thermore, it is possible to increase the number of virtual CPUs 
virtualized by one single processor beyond those shown in the 
two virtual machines 95, 97, subject to practical limitations 
of overhead. Furthermore, the number of virtual nodes sharing 
one physical node can be increased, again subject to limita- 
tions of overhead. 

Embodiment 4 . 

[0084] Reference is now made to Fig. 6 which is a block 

diagram of a cluster-based virtual computing system 12 0 in ac- 
cordance with a disclosed embodiment of the invention. A plu- 
rality of user applications 12, 14, 16 execute simultaneously, 

supported by the guest operating system 18, which can be any 

® 

conventional operating system, e.g., Microsoft Windows , Li- 
nux®, Solaris® X86. The guest operating system 18 controls the 
virtual machine 20, which presents itself to the guest op- 
erating system 18 as though it were a conventional real ma- 
chine . 

[0085] The system 120 has a plurality of nodes 122, 

124, 126, 128. While four nodes are shown, the system 120 is 
scalable, and any number of nodes may be present, depending on 
the needs of a particular application and the performance de- 
sired. The nodes 122, 124, 126, 128 each comprise computer 
hardware 28, which in a current embodiment has the X86/IA32 ar- 
chitecture. However, as noted above, the invention is not lim- 
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ited to the X86/IA32 family of processors, but is applicable to 
other computer architectures. The hardware 2 8 includes nodal 
memory 30, and may also be provided with a NIC 130 or other 
network communications facilities, and with many other types of 
conventional personal computer I/O devices 132. The nodes 122, 
124, 126, 128 may be identically configured. Alternatively, 
different versions of the X86/IA32 processor may be placed in 
different nodes. Other aspects of the computer hardware in dif- 
ferent nodes may also vary in different nodes, e.g., processor 
speed, bus speed, memory configuration, and I/O facilities. 

[0086] In the nodes 122, 126, 128, each of the CPU's is 

provided with a virtual machine monitor 134. The node 124 is 
provided with two virtual machine monitors 13 6, 138, which 
share the resources of the hardware 28, as shown in the forego- 
ing embodiments. 

[0087] In this embodiment, the virtual machine moni- 

tors 134, 13 6, 13 8 are driven entirely by interrupts, and do 
not schedule for themselves any processing slots. They only re- 
act to actions taken by the guest operating system 18 or by the 
applications 12, 14, 16, and to interrupts generated in the 
hardware 28. 

[0088] The virtual machine monitors 134, 136, 138 have 

a flexible policy for handling faults, exceptions and inter- 
rupts depending on their individual characteristics. This may 
be effected by a mechanism known as "scan before execute", 
which, as implied by its name, scans the code prior to execu- 
tion and causes software interrupts to occur at the relevant 
places. Alternatively, the policy may be effected by a mecha- 
nism known as dynamic translation. Both of these techniques 
scan the code, differentiating between code that can be run 
natively, i.e., directly on the hardware 28, and the code that 
should not be run natively. For the latter, the code is altered 
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either to generate a trap to the virtual machine monitor or to 
jump directly to a virtual machine monitor function. The vir- 
tual machine monitor can then emulate a current instruction 
that should not be run natively. These techniques yield reason- 
5 able efficiency, as in practice most code can be run natively 
and only a small portion need to be emulated. Scanning the code 
prior to execution is not expensive, as the same code is often 
run many times, in which case only one scan is needed. 

[0089] In some cases, the X86/IA32 architecture permits 

10 passing faults, exceptions and interrupts to the guest operat- 
ing system 18 without modification. In other cases, faults, ex- 
ceptions and interrupts may be hidden from the guest operating 
system 18. In still other cases, faults, exceptions and inter- 
rupts are processed internally by the virtual machine moni- 

15 tors 134, 136, 138, which may direct subsequent actions to be 
taken with respect to the guest operating system 18. For in- 
stance, a new interrupt may be generated and sent to the guest 
operating system 18 for processing. Generating an interrupt is 
done by emulating the CPU behavior while getting an interrupt. 

20 [0090] For those instructions that require emulation or 

other modification, an integrated machine emulator, which is 
part of the virtual machine monitor is used. 

Memory management subsystem, 

[0091] Continuing to refer to Fig. 6, memory coherence 

2 5 among the memory 3 0 of the nodes 122, 124, 126, 12 8 is achieved 

by a memory management subsystem 140, which is integrated in 
the virtual machine implementers 134, 136, 138. The virtual ma- 
chine implementers 134, 136, 138 are each provided with a mem- 
ory access hook and I/O access for the memory management sub- 

3 0 system 14 0. The private network 44 provides rapid internodal 

communication that is necessary for the operation of the memory 
management subsystem 140. The virtual machine implementers 134, 
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136, 138 typically use a paging mechanism when the implementer 
is implemented as a virtual machine monitor to synchronize the 
memory 30. Memory caches are established on different 
nodes 122, 124, 126, 128 in order to allow faster access to re- 
cently used segments of the memory 30. 

[0092] The virtual machine implementers 134, 136, 138 

initialize the memory management subsystem 14 0 using the call 
INIT(). During initialization, the memory management subsys- 
tem 140 invalidates all local pages of the memory 30 for read 
and write access. 

[0093] During subsequent operation, the virtual machine 

implementers 134, 136, 138 calls the memory management subsys- 
tem 140 in order to obtain read or write access to a physical 
page, which is currently marked as invalid for the specified 
access type. The memory management subsystem 14 0 also calls the 
virtual machine implementers 134, 136, 138 when required in or- 
der to invalidate a page for a specified access type, provided 
that the page should no longer be accessed by the CPU in the 
hardware 28 for that particular type of access. Alternatively, 
the page is validated for a specified access type if it has be- 
come available for that type of access. The memory management 
subsystem 140 requests page invalidation or validation using a 
physical address. Virtual machine monitors, which are used as 
the virtual machine implementers 134, 136, 138 use a reverse 
page lookup mechanism in order to update the processor paging 
table and invalidate the processor translation lookaside buffer 
(TLB) . A description of the interface used for page access con- 
trol and retrieval by the memory management subsystem 14 0 is 
found in Table 2 . 



Table 2 . 



I NV_PAGE ( PHY_ADD , 
RW) 



Invalidate request for a physical 
page using its physical address and 
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access type 


VLD_PAGE ( PHY_ADD , 
RW) 


Validate request for a physical page 
using its physical address and access 
type 


GET PAGE (PHY_ADD, 
RW, BUFFER, LENGTH, 
OP) 


Get read or write access to physical 
memory address using its physical ad- 
dress and access type. 



[0094] In the function GET_PAGE, the parameter RW is a 

flag indicating the type of access intended. The parameters 
BUFFER and LENGTH are used to pass data in the case of a write 
5 operation and return data for a read operation. In case of 
read-modif y-write operation, the function is called with the 
parameter RW set to a value of RMW. The parameter OP is proces- 
sor dependent, and would thus be different in a processor out- 
side the X8 6/IA32 family. It can indicate any of several opera- 

10 tions, for example, increment, decrement, store and return pre- 
vious value, and test and set. 

[0095] For embodiments in which one or more emulators 

are used as the virtual machine implementers 134, 136, 138, the 
above techniques can also be used. The virtual machine imple- 

15 menters 134, 136, 138 in such embodiments call the memory man- 
agement subsystem 140 each time physical memory access is 
needed. An API MEM_ACCESS (PHY_ADD, RW) provides memory access 
for a physical page using its physical address and access type 
as a replacement for the CPU paging mechanism used in the vir- 

2 0 tual machine monitor. 

[0096] It will be appreciated by persons skilled in the 

art that the present invention is not limited to what has been 
particularly shown and described hereinabove. Rather, the scope 
of the present invention includes both combinations and sub- 

25 combinations of the various features described hereinabove, as 
well as variations and modifications thereof that are not in 
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the prior art, which would occur to persons skilled in the art 
upon reading the foregoing description. 
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Appendix 1 

[0097] The computer software on the compact disks con- 

taining the computer program listing appendix hereof may be in- 
stalled and executed as follows: 

Hardware • 

[009 8] Provide an IBM compatible personal computer with 

a minimum of 512MB RAM and an Intel Pentium IV central process- 
ing unit, two IDE hard disks with a minimum of 40 Gigabytes of 
disk space. Each IDE hard disk should be connected to its own 
individual IDE controller. 

Software (Installation) . 

Host Operating System (located on the first IDE controlled hard 
disk) . 

[0099] Copy the file. CONFIG in the root folder stored 

in the appended CD-ROM into a temporary directory. 

[0100] Install the Linux 2.4.20 kernel available from 

Redhat, Corporate HQ: 1801 Varsity Drive, Raleigh, NC 27606, 
USA. 

[0101] Install and Compile the Linux 2.4.21 kernel 

patch available from Kernel Dot Org Organization, 3990 Freedom 
Circle, Santa Clara, California 95054, USA using the. CONFIG 
file mentioned above. 

[0102] Add the mem=2 00M argument to the Linux boot com- 

mand and reboot the Computer. 

[0103] Copy the files BIOS. HEX, SCMPVMMO .HEX, 

SCMPVMMS.HEX and USERMODE . HEX in the root folder stored in the 
appended CD-ROM into a temporary directory. 

[0104] Unhex the computer listing BIOS. HEX, 

SCMPVMMO . HEX , SCMPVMMS . HEX and USERMODE . HEX using HEX IT VI . 8 
or greater by John Augustine, 3129 Earl St., Laureldale, 
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Pa 19605, USA creating the files BIOS, SCMPVMM.O, SCMPVMM.SH 
and USERMODE, respectively. 

Guest Operating System (located on the second IDE controlled 
hard disk) . 

5 [0105] Install the Linux 2.4.20 kernel available from 

Redhat, Corporate HQ: 1801 Varsity Drive, Raleigh, NC 27606, 
USA. 

[0106] Install and Compile the Linux 2.4.21 kernel 

patch available from Kernel Dot Org Organization, 3 990 Freedom 
10 Circle, Santa Clara, California 95054, USA using the above- 
noted. CONFIG file. 

[0107] Reboot the Computer. 

Running instructions . 

[0108] The system should be run by a user with supervi- 

15 sor privileges on the Linux system (typically root) . 

[0109] The system must be run from a text mode screen 

(not from within a X-windows terminal) on the host. 

[0110] Run the scmpvmm.sh shell script with a single 

parameter of start . 
20 [0111] Typically ' sh scmpvmm.sh start' 

[0112] Run the usermode program, Typically 

1 . /usermode 1 . 



